<img src="https://github.com/caufieldjh/awesome-bioie/blob/master/images/abie_head.png" alt="Awesome BioIE Logo"/>
<br>
<a href="https://awesome.re">
    <img src="https://awesome.re/badge-flat2.svg" alt="Awesome">
</a>
<br>
How to extract information from unstructured biomedical data and text.
<br>
     
    
      What is BioIE? It includes any effort to extract structured information
      from unstructured (or, at least inconsistently structured)
      biological, clinical, or other biomedical data. The data source is often
      some collection of text documents written in technical language. If the
      resulting information is verifiable and consistent across sources, we may
      then consider it knowledge. Extracting information and producing
      knowledge from bio data requires adaptations upon methods developed for
      other types of unstructured data.
    
    
      Resources included here are preferentially those available at no monetary
      cost and limited license requirements. Methods and datasets should be
      publicly accessible and actively maintained.
    
    
      See also awesome-nlp,
      awesome-biology
      and
      Awesome-Bioinformatics.
    
    
      Please read the
        contribution guidelines before
        contributing. Please add your favourite resource by raising a
        pull request.
    
    Contents
    
    Research Overviews
    
    Back to Top
    Groups Active in the Field
    
      - 
        Boston Children’s Hospital Natural Language Processing Laboratory
        - Led by Dr. Guergana Savova, formerly at Mayo Clinic and the Apache
        cTAKES project.
      
 
      - 
        BD2K - The U.S. National
        Institutes of Health (NIH) funded 13 Centers of Excellence through their
        Big Data to Knowledge (BD2K) program, several of which developed tools
        and resources for BioIE.
        
          - 
            HeartBD2K - Based at
            University of California, Los Angeles (UCLA). Led by Dr. Peipei
            Ping.
          
 
          - 
            KnowEng - Based an
            University of Illinois at Urbana-Champaign (UIUC). Led by Dr. Jiawei
            Han.
          
 
          - 
            Mobilize - Based at
            Stanford. Led by Dr. Scott Delp.
          
 
        
       
      - 
        Brown Center for Biomedical Informatics
        - Based at Brown University and directed by Dr. Neil Sarkar, whose
        research group works on topics in clinical NLP and IE.
      
 
      - 
        Center for Computational Pharmacology NLP Group
        - based at University of Colorado, Denver and led by Larry Hunter -
        see their GitHub repos here.
      
 
      - 
        Groups at U.S. National Institutes of Health (NIH) / National Library of
        Medicine (NLM):
        
      
 
      - 
        JensenLab - Based at the Novo
        Nordisk Foundation Center for Protein Research at the University of
        Copenhagen, Denmark.
      
 
      - 
        National Centre for Text Mining (NaCTeM)
        - Based at the University of Manchester and led by Prof. Sophia
        Ananiadou, NaCTeM is concerned with text mining in general but has a
        particular focus on biomedical applications.
      
 
      - 
        Mayo Clinic’s clinical natural language processing program
        - Several groups at Mayo Clinic have made major contributions to BioIE
        (for example, the Apache cTAKES platform) over the past 20 years.
      
 
      - 
        Monarch Initiative - A
        joint effort between groups at Oregon State University, Oregon Health
        & Science University, Lawrence Berkeley National Lab, The Jackson
        Laboratory, and several others, seeking to “integrate biological
        information using semantics, and present it in a novel way, leveraging
        phenotypes to bridge the knowledge gap”.
      
 
      - 
        TurkuNLP - Based at the University
        of Turku and concerned with NLP in general with a focus on BioNLP and
        clinical applications.
      
 
      - 
        UTHealth Houston Biomedical Natural Language Processing Lab
        - Based in the University of Texas Health Science Center at Houston,
        School of Biomedical Informatics and led by Dr. Hua Xu.
      
 
      - 
        VCU Natural Language Processing Lab
        - Based at Virginia Commonwealth University and led by Dr. Bridget
        McInnes.
      
 
      - 
        Zaklab - Group led by Dr. Isaac Kohane
        at Harvard Medical School’s Department of Biomedical Informatics
        (Dr. Kohane is also a steward of the n2c2 (formerly i2b2) datasets - see
        Datasets below).
      
 
      - 
        Columbia University Department of Biomedical Informatics
        - Led by Drs. George Hripcsak and Noémie Elhadad.
      
 
    
    Back to Top
    Organizations
    
      - 
        AMIA - Many—but certainly not
        all—individuals studying biomedical informatics are members of the
        American Medical Informatics Association. AMIA publishes a journal,
        JAMIA (see below).
      
 
      - 
        IMIA - The International Medical
        Informatics Association. Publishes the IMIA Yearbook of Medical
        Informatics.
      
 
    
    Back to Top
    Journals and Events
    
      The interdisciplinary nature of BioIE means researchers in this space may
      share their findings and tools in a variety of ways. They may publish
      papers in journals, as is common in the biomedical and life sciences. They
      may publish conference papers and, upon acceptance, give a poster and/or
      oral presentation at an event; this is common practice in computer science
      and engineering fields. Conference papers are often published in
      collections of proceedings. Preprint publication is an increasingly
      popular and institutionally-accepted way to publish findings as well.
      Surrounding these formal, written products are the ideas of
      open science,
      open data, and open source: the code, data, and software BioIE researchers
      develop are valuable resources to the community.
    
    Journals
    
      For preprints, try arXiv, especially the
      subjects Computation and Language (cs.CL) and Information Retrieval
      (cs.IR); bioRxiv; or
      medRxiv, especially the Health
      Informatics subject area.
    
    
      - 
        Database - Its subtitle
        is “The Journal of Biological Databases and Curation”. Open access.
      
 
      - 
        NAR - Nucleic Acids Research.
        Has a broad biomolecular focus but is particularly notable for its
        annual database issue.
      
 
      - 
        JAMIA - The Journal of the
        American Medical Informatics Association. Concerns “articles in the
        areas of clinical care, clinical research, translational science,
        implementation science, imaging, education, consumer health, public
        health, and policy”.
      
 
      - 
        JBI
        - The Journal of Biomedical Informatics. Not open access by default,
        though it does have an open-access “X” version.
      
 
      - 
        Scientific Data - An
        open-access Springer Nature journal publishing “descriptions of
        scientifically valuable datasets, and research that advances the sharing
        and reuse of scientific data”.
      
 
    
    Conferences and Other Events
    
      - 
        ACM-BCB - The ACM Conference on
        Bioinformatics, Computational Biology, and Health Informatics. Held
        annually since 2010.
      
 
      - 
        BIBM - The IEEE
        International Conference on Bioinformatics and Biomedicine.
      
 
      - 
        ISMB - The International
        Conference on Intelligent Systems for Molecular Biology is an annual
        conference hosted by the International Society for Computational Biology
        since 1993. Much of its focus has concerned bioinformatics and
        computational biology without an explicit clinical focus, though it has
        included an increasing amount of text mining content (e.g., the 2019
        meeting included a
        full-day special session on Text Mining for Biology and Healthcare). The meeting is combined with that of the European Conference on
        Computational Biology (ECCB) on odd-numbered years.
      
 
      - 
        PSB - The Pacific Symposium on
        Biocomputing.
      
 
    
    Challenges
    
      Some events in BioIE are organized around formal tasks and challenges in
      which groups develop their own computational solutions, given a dataset.
    
    
      - 
        BioASQ - Challenges on biomedical
        semantic indexing and question answering. Challenges and workshops held
        annually since 2013.
      
 
      - 
        BioCreAtIvE workshop
        - These workshops have been organized since 2004, with BioCreative VI
        happening February 2017 and the
        BioCreative/OHNLP Challenge
        held in 2018. See Datasets below.
      
 
      - 
        SemEval workshop - Tasks
        and evaluations in computational semantic analysis. Tasks vary by year
        but frequently cover scientific and/or biomedical language, e.g. the
        SemEval-2019 Task 12 on Toponym Resolution in Scientific Papers.
      
 
      - 
        eHealth-KD
        - Challenges for encouraging “development of software technologies to
        automatically extract a large variety of knowledge from eHealth
        documents written in the Spanish Language”. Previously held as part of
        TASS, an annual
        workshop for semantic analysis in Spanish.
      
 
      - 
        EHR DREAM Challenge
        - Held along with several other
        more bioinformatics-focused challenges, this challenge opened in October 2019 and focuses on using electronic
        health record data to predict patient mortality. Uses a synthetic data
        set rather than real EHR contents.
      
 
    
    Back to Top
    Tutorials
    
      The field changes rapidly enough that tutorials any older than a few years
      are missing crucial details. A few more recent educational resources are
      listed below. A good foundational understanding of text mining techniques
      is very helpful, as is some basic experience with the Python and or R
      languages. Starting with the
      NLTK tutorials and then trying
      out the tutorials for the
      Flair framework
      will provide excellent examples of natural language processing, text
      mining, and modern machine learning-driven methods, all in Python. Most of
      the examples don’t include anything biomedical, however, so the best
      option may be to learn by doing.
    
    Guides
    
    
      Video Lectures and Online Courses
    
    
    Back to Top
    Code Libraries
    
      - 
        Biopython -
        paper -
        code - Python tools
        primarily intended for bioinformatics and computational molecular
        biology purposes, but also a convenient way to obtain data, including
        documents/abstracts from PubMed (see Chapter 9 of the documentation).
      
 
      - 
        Bio-SCoRes -
        paper
        - A framework for biomedical coreference resolution.
      
 
      - 
        medaCy - A system for
        building predictive medical natural language processing models. Built on
        the spaCy framework.
      
 
      - 
        ScispaCy -
        paper - A version of the
        spaCy framework for scientific and
        biomedical documents.
      
 
      - 
        rentrez - R utilities
        for accessing NCBI resources, including PubMed.
      
 
      - 
        Med7
        - paper -
        code - a Python
        package and model (for use with spaCy) for doing NER with
        medication-related concepts.
      
 
    
    Repos for Specific Datasets
    
      - 
        mimic-code - Code
        associated with the MIMIC-III dataset (see below). Includes some helpful
        tutorials.
      
 
    
    Back to Top
    
    
      - 
        cTAKES -
        paper
        - code - A system for
        processing the text in electronic medical records. Widely used and open
        source.
      
 
      - 
        CLAMP -
        paper
        - A natural language processing toolkit intended for use with the text
        in clinical reports. Check out their
        live demo first to see
        what it does. Usable at no cost for academic research.
      
 
      - 
        DeepPhe - A
        system for processing documents describing cancer presentations. Based
        on cTAKES (see above).
      
 
      - 
        DNorm
        -
        paper
        - A method for disease normalization, i.e., linking mentions of disease
        names and acronyms to unique concept identifiers. Downloadable version
        includes the NCBI Disease Corpus and BC5CDR (see Annotated Text Data
        below).
      
 
      - 
        PubTator Central
        -
        paper
        - A web platform that identifies five different types of biomedical
        concepts in PubMed articles and PubMed Central full texts. The full
        annotation sets are downloadable (see
        Annotated Text Data below).
      
 
      - 
        Pubrunner - A
        framework for running text mining tools on the newest set(s) of
        documents from PubMed.
      
 
      - 
        SemEHR -
        paper
        - an IE infrastructure for electronic health records (EHR). Built on the
        CogStack project.
      
 
      - 
        TaggerOne
        -
        paper
        - Performs concept normalization (see also DNorm above). Can be trained
        for specific concept types and can perform NER independent of other
        normalization functions.
      
 
      - 
        TabInOut -
        paper
        - a framework for IE from tables in the literature.
      
 
    
    
    
      - 
        Anafora -
        paper
        - An annotation tool with adjudication and progress tracking features.
      
 
      - 
        brat -
        paper -
        code - The brat rapid
        annotation tool. Supports producing text annotations visually, through
        the browser. Not subject specific; appropriate for many annotation
        projects. Visualization is based on that of the
        stav tool.
      
 
    
    Back to Top
    Techniques
    Text Embeddings
    
      This paper from Hongfang Liu’s group at Mayo Clinic
      demonstrates how text embeddings trained on biomedical or clinical text
      can, but don’t always, perform better on biomedical natural language
      processing tasks. That being said, pre-trained embeddings may be
      appropriate for your needs, especially as training domain-specific
      embeddings can be computationally intensive.
    
    Word Embeddings
    
      - 
        BioASQword2vec
        - paper - Qord
        embeddings derived from biomedical text (>10 million PubMed
        abstracts) using the popular
        word2vec tool.
      
 
      - 
        BioWordVec
        -
        paper -
        code - Word
        embeddings derived from biomedical text (>27 million PubMed titles
        and abstracts), including subword embedding model based on MeSH.
      
 
    
    Language Models
    
      - 
        BioBERT -
        paper -
        code - A PubMed and
        PubMed Central-trained version of the
        BERT language model.
      
 
      - 
        ClinicalBERT - Two language models trained on clinical text have similar
        names. Both are BERT models trained on the text of clinical notes from
        the MIMIC-III dataset.
        
      
 
      - 
        Flair embeddings from PubMed
        - A language model available through the Flair framework and embedding
        method. Trained over a 5% sample of PubMed abstracts until 2015, or >
        1.2 million abstracts in total.
      
 
      - 
        SciBERT -
        paper - A BERT model
        trained on >1M papers from the Semantic Scholar database.
      
 
      - 
        BlueBERT -
        paper - A BERT model
        pre-trained on PubMed text and MIMIC-III notes.
      
 
      - 
        PubMedBERT -
        paper - A BERT model
        trained from scratch on PubMed, with versions trained on abstracts+full
        texts and on abstracts alone.
      
 
    
    Back to Top
    Datasets
    
      Some of the datasets listed below require a
      UMLS Terminology Services (UTS) account
      to access. Please note that the license granted with the UTS account
      requires users to submit an annual report about their use of UMLS
      resources. This is less challenging than it sounds.
    
    Biomedical Text Sources
    
      The following resources contain indexed text documents in the biomedical
      sciences. *
      OHSUMED -
      paper - 348,566
      MEDLINE entries (title and sometimes abstract) from between 1987 and 1991.
      Includes MeSH labels. Primarily of historical significance. *
      PubMed Central Open Access Subset
      - A set of PubMed Central articles usable under licenses other than
      traditional copyright, though the exact licenses vary by publication and
      source. Articles are available as PDF and XML. *
      CORD-19
      - A corpus of scholarly manuscripts concerning COVID-19. Articles are
      primarily from PubMed Central and preprint servers, though the set also
      includes metadata on papers without full-text availability.
    
    Annotated Text Data
    
      - 
        SPL-ADR-200db
        - paper - A
        pilot dataset containing standardised information, and annotations of
        occurence in text, about ~5,000 known adverse reactions for 200
        FDA-approved drugs.
      
 
      - 
        BioCreAtIvE 1
        -
        paper
        - 15,000 sentences (10,000 training and 5,000 test) annotated for
        protein and gene names. 1,000 full text biomedical research articles
        annotated with protein names and Gene Ontology terms.
      
 
      - 
        BioCreAtIvE 2
        -
        paper
        - 15,000 sentences (10,000 training and 5,000 test, different from the
        first corpus) annotated for protein and gene names. 542 abstracts linked
        to EntrezGene identifiers. A variety of research articles annotated for
        features of protein–protein interactions.
      
 
      - 
        BioCreAtIvE V CDR Task Corpus (BC5CDR)
        -
        paper
        - 1,500 articles (title and abstract) published in 2014 or later,
        annotated for 4,409 chemicals, 5,818 diseases and 3116 chemical–disease
        interactions. Requires registration.
      
 
      - 
        BioCreative VI CHEMPROT Corpus
        -
        paper
        - >2,400 articles annotated with chemical-protein interactions of a
        variety of relation types. Requires registration.
      
 
      - 
        CRAFT -
        paper
        - 67 full-text biomedical articles annotated in a variety of ways,
        including for concepts and coreferences. Now on version 3.
      
 
      - 
        n2c2 (formerly i2b2) Data
        - The Department of Biomedical Informatics (DBMI) at Harvard Medical
        School manages data for the National NLP Clinical Challenges and the
        Informatics for Integrating Biology and the Bedside challenges running
        since 2006. They require registration before access and use. Datasets
        include a variety of topics. See the
        list of data challenges
        for individual descriptions.
      
 
      - 
        NCBI Disease Corpus
        -
        paper
        - A corpus of 793 biomedical abstracts annotated with names of diseases
        and related concepts from MeSH and OMIM.
      
 
      - 
        PubTator Central datasets
        -
        paper
        - Accessible through a RESTful API or FTP download. Includes annotations
        for >29 million abstracts and ∼3 million full text documents.
      
 
      - 
        Word Sense Disambiguation (WSD) -
        paper
        - 203 ambiguous words and 37,888 automatically extracted instances of
        their use in biomedical research publications. Requires UTS account.
      
 
      - 
        Clinical Questions Collection
        - also known as CQC or the Iowa collection, these are several thousand
        questions posed by physicians during office visits along with the
        associated answers.
      
 
      - 
        BioNLP ST 2013 datasets - data
        from six shared tasks, though some may not be easily accessible; try the
        CG task set (BioNLP2013CG) for extensive entity and event annotations.
      
 
      - 
        BioScope -
        paper
        - a corpus of sentences from medical and biological documents, annotated
        for negation, speculation, and linguistic scope.
      
 
    
    
      Protein-protein Interaction Annotated Corpora
    
    
      Protein-protein interactions are abbreviated as PPI. The following sets
      are available in BioC format.
      The older sets (AIMed, BioInfer, HPRD50, IEPA, and LLL) are available
      courtesy of the
      WBI corpora repository
      and were originally derived from the original sets by a
      group at Turku University.
    
    
      - 
        AIMed
        - paper - 225
        MEDLINE abstracts annotated for PPI.
      
 
      - 
        BioC-BioGRID
        -
        paper
        - 120 full text articles annotated for PPI and genetic interactions.
        Used in the BioCreative V BioC task.
      
 
      - 
        BioInfer
        -
        paper
        - 1,100 sentences from biomedical research abstracts annotated for
        relationships (including PPI), named entities, and syntactic
        dependencies.
        Additional information and download links are here.
      
 
      - 
        HPRD50
        -
        paper
        - 50 scientific abstracts referenced by the Human Protein Reference
        Database, annotated for PPI.
      
 
      - 
        IEPA
        -
        paper
        - 486 sentences from biomedical research abstracts annotated for pairs
        of co-occurring chemicals, including proteins (hence, PPI annotations).
      
 
      - 
        LLL
        -
        paper
        - 77 sentences from research articles about the bacterium
        Bacillus subtilis, annotated for protein–gene interactions (so,
        fairly close to PPI annotations).
        Additional information is here.
      
 
    
    Other Datasets
    
      - 
        Columbia Open Health Data -
        paper - A
        database of prevalence and co-occurrence frequencies of conditions,
        drugs, procedures, and patient demographics extracted from electronic
        health records. Does not include original record text.
      
 
      - 
        Comparative Toxicogenomics Database -
        paper
        - A database of manually curated associations between chemicals, gene
        products, phenotypes, diseases, and environmental exposures. Useful for
        assembling ontologies of the related concepts, such as types of
        chemicals.
      
 
      - 
        MIMIC-III -
        paper -
        Deidentified health data from ~60,000 intensive care unit admissions.
        Requires completion of an online training course (CITI training) and
        acceptance of a data use agreement prior to use.
      
 
      - 
        MIMIC-CXR -
        The MIMIC Chest X-Ray database. Contains more than 377,000 radiographic
        images and accompanying free-text radiology reports. As with MIMIC-III,
        requires acceptance of a data use agreement.
      
 
      - 
        UMLS Knowledge Sources
        -
        reference manual
        - A large and comprehensive collection of biomedical terminology and
        identifiers, as well as accompanying tools and scripts. Depending on
        your purposes, the single file MRCONSO.RRF may be sufficient, as this
        file contains unique identifiers and names for all concepts in the UMLS
        Metathesaurus. See also the Ontologies and Controlled Vocabularies
        section below.
      
 
      - 
        MIMIC-IV - An update to
        MIMIC-III’s multimodal patient data, now covering more recent years of
        admissions, plus a new data structure, emergency department records, and
        links to MIMIC-CXR images.
      
 
      - 
        eICU Collaborative Research Database
        - paper - a
        database of observations from more than 200 thousand intensive care unit
        admissions, with consistent structure. Requires registration, training
        course completion, and data use agreement.
      
 
    
    Back to Top
    
      Ontologies and Controlled Vocabularies
    
    
      - 
        Disease Ontology -
        paper
        - An ontology of human diseases. Has cross-links to MeSH, ICD, NCI
        Thesaurus, SNOMED, and OMIM. Public domain. Available on
        GitHub
        and on the
        OBO Foundry.
      
 
      - 
        RxNorm
        -
        paper
        - Normalized names for clinical drugs and drug packs, with combined
        ingredients, strengths, and form, and assigned types from the Semantic
        Network (see below). Released monthly.
      
 
      - 
        SPECIALIST Lexicon
        -
        paper
        - A general English lexicon that includes many biomedical terms. Updated
        yearly since 1994 and still updated as of 2019. Part of UMLS but does
        not require UTS account to download.
      
 
      - 
        UMLS Metathesaurus
        -
        paper
        - Mappings between >3.8 million concepts, 14 million concept names,
        and >200 sources of biomedical vocabulary and identifiers. It’s big.
        It may help to prepare a subset of the Metathesaurus with the
        MetamorphoSys installation tool
        but we’re still talking about ~30 Gb of disk space required for the 2019
        release.
        See the manual here. Requires UTS account.
      
 
      - 
        UMLS Semantic Network
        -
        paper
        - Lists of 133 semantic types and 54 semantic relationships covering
        biomedical concepts and vocabulary. Is the Metathesaurus too complex for
        your needs? Try this. Does not require UTS account to download.
      
 
    
    Back to Top
    Data Models
    
      Do you need a
      data model? If you
      are working with biomedical data, then the answer is probably “Yes”.
    
    
      - 
        Biolink -
        code - A data
        model of biological entities. Provided as a
        YAML file.
      
 
      - 
        BioUML -
        paper
        - An architecture for biomedical data analysis, integration, and
        visualization. Conceptually based on the visual modeling language
        UML.
      
 
      - 
        OMOP Common Data Model
        - a standard for observational healthcare data.
      
 
    
    Back to Top
    Credits
    Credits for curators and sources.
    License
    
      
    
    License